-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redfishpower: adapt status polling interval #167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would an exponential backoff with a slower start allow you to avoid the need for the more logic based approach you've got ? E.g.
#include <stdio.h>
int main(int argc, char **argv)
{
for (double d = 1.0; d < 10; d *= 1.3)
printf ("%.2f\n", d);
}
which gets you
1.00
1.30
1.69
2.20
2.86
3.71
4.83
6.27
8.16
It seems like a fixed polling interval is unlikely to be anybody's choice and we could skip that?
Thoughts?
It was half added just as an "emergency" in case some random board we come upon doesn't quite meet the adapted algorithm I came up with. That commit could be removed, it isn't super necessary./
hmmmm, I guess that could work. I guess on the boards I've seen that are 47-50 seconds, I'm just afraid of the exponential backoff growing too large around that point in time, i.e. the exponential backoff has grown to 10+ seconds. In your example it would have grown to ~13 seconds and the overall wait would be 56 seconds. With my current adapted one we'd wait 48 or 52 seconds. I think it's sort of a draw? The adapted one was calculated specifically for the cases I've seen. |
Oh I was figuring you would cap the delay at 10s or so (or whatever makes sense). If we need to make it tunable, then I would add a command that lets you sets the start, multiplicative factor, and cap instead of only allowing only a non-adaptive delay to be configured. But I'm not sure its needed. |
ahhh ok. Then perhaps capping it at ... ehhh 5 seconds or so should be a good balance. maybe the multiplicative would be like 1.1 or 1.2 then. I'll play around with it. |
fed746e
to
a3a211b
Compare
re-pushed, removing the I ended up just keeping the "logic based" exponential backoff, b/c doing the 1.3X (or similar) multiplier ended up not quite as logically simple as we would have thought (i.e. "if this is the first poll, it is 1 second", "if this poll is > X seconds, cap at X seconds", etc. logic added just as much logic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I made the same comment multiple times - we just need to document the specific hardware that this algorithm was developed for IMHO.
src/redfishpower/redfishpower.c
Outdated
* shows wait ranges from a few seconds to 20 seconds | ||
* shows wait for on/off to complete ranges from a few seconds to 50 | ||
* seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this test is very hardware specific the specific hardware/firmware should be mentioned.
* timeout - when the overall power command times out | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message should say what hardware the observations were based on and perhaps have a bit more detail about the observations themselves? Without that context, when some hardware that behaves differently comes a long, we can't make an informed decision about how to move forward.
src/redfishpower/redfishpower.c
Outdated
/* testing a range of hardware shows that the amount of time it | ||
* takes to complete an on/off falls into two bands. Either it | ||
* completes in the 3-6 second range OR it takes 20-50 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about hardware specifics.
a3a211b
to
ebb1c36
Compare
Problem: The status polling interval is hard coded to 1 second long. This can result in an excessive number of polling messages being sent when it is known that some hardware takes 20-60 seconds to complete a power operation. As an example, on an HPE Cray Supercomputing EX chassis, the power on of a node takes around 50 seconds, while a power off takes around 6 seconds. Solution: Support a modified "exponential backoff" of the status polling interval. The modified algorithm is based on observations of how long it typically takes to complete power operations on hardware. The status polling interval begins at one second, but it gets capped at 4 seconds.
ebb1c36
to
2d295d9
Compare
re-pushed
|
Perfect - thanks! |
Problem: The status polling interval is hard coded to 1 second long. This can result in an excessive number of polling messages being sent when it is known that some hardware takes 20-50 seconds to complete a power operation.
Solution: Support a modified "exponential backoff" of the status polling interval. The modified algorithm is based on observations of how long it typically takes to complete power operations on hardware. The status polling interval begins at one second, but it gets capped at 4 seconds.